Later: repeat in Python.
Load libraries
Link to the file.
## Parsed with column specification:
## cols(
## number = col_double(),
## game = col_character(),
## release_date = col_character(),
## price = col_double(),
## owners = col_character(),
## developer = col_character(),
## publisher = col_character(),
## average_playtime = col_double(),
## median_playtime = col_double(),
## metascore = col_double()
## )
Convert to date format:
owners is shown in N1 .. N2 format so I will assume that this is minimum and maximum number of owners. I will extract the columns and create corresponding columns with the maximum and the minimum number of players.
temp <- video_games$owners %>% stringr::str_split(., pattern = stringr::fixed(".."))
temp_min <- temp %>% sapply(., "[", 1) %>%
stringr::str_trim(., side = "both") %>%
stringr::str_replace_all(., pattern = ",", "") %>%
as.numeric()
temp_max <- temp %>% sapply(., "[", 2) %>%
stringr::str_trim(., side = "both") %>%
stringr::str_replace_all(., pattern = ",", "") %>%
as.numeric()
video_games <- video_games %>% select(-owners) %>% mutate(min_owners = temp_min,
max_owners = temp_max)Review missing values:
video_games %>% filter(is.na(game)) %>%
summarise(total = n()) %>%
knitr::kable(format = "html",
caption = "Number of rows where game name is missing",
align = "l") %>%
kableExtra::kable_styling()| total |
|---|
| 3 |
video_games %>% filter(is.na(game)) %>%
knitr::kable(format = "html", caption = "Rows with missing game names",
align = "c") %>%
kableExtra::kable_styling()| number | game | release_date | price | developer | publisher | average_playtime | median_playtime | metascore | min_owners | max_owners |
|---|---|---|---|---|---|---|---|---|---|---|
| 808 | NA | 2016-11-01 | NA | NA | Aterdux Entertainment | 0 | 0 | NA | 20000 | 50000 |
| 138 | NA | 2017-02-17 | 8.99 | Jeroen Wimmers | Jeroen Wimmers | 0 | 0 | 76 | 0 | 20000 |
| 3945 | NA | 2018-03-27 | 0.99 | Paleno Games | Paleno Games | 0 | 0 | NA | 0 | 20000 |
Remove missing values for games
Are there any missing both developer and publisher columns?
video_games %>% filter(is.na(developer) & is.na(publisher)) %>%
summarise(total = n()) %>%
knitr::kable(format = "html",
caption = "Number of rows missing developer and publisher names",
align = "c") %>%
kableExtra::kable_styling()| total |
|---|
| 47 |
Only the first 10 rows because the dataset is large, see below results of skimr::skim()
Skim summary statistics
n obs: 26685
n variables: 11
| variable | missing | complete | n | min | max | empty | n_unique |
|---|---|---|---|---|---|---|---|
| developer | 150 | 26535 | 26685 | 1 | 237 | 0 | 16751 |
| game | 0 | 26685 | 26685 | 1 | 212 | 0 | 26610 |
| publisher | 95 | 26590 | 26685 | 1 | 131 | 0 | 13952 |
| variable | missing | complete | n | min | max | median | n_unique |
|---|---|---|---|---|---|---|---|
| release_date | 42 | 26643 | 26685 | 2004-01-11 | 2018-12-31 | 2017-06-09 | 2942 |
| variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| average_playtime | 9 | 26676 | 26685 | 9.06 | 117.94 | 0 | 0 | 0 | 0 | 5670 | ▇▁▁▁▁▁▁▁ |
| max_owners | 0 | 26685 | 26685 | 211169.95 | 1944091.17 | 20000 | 20000 | 20000 | 50000 | 2e+08 | ▇▁▁▁▁▁▁▁ |
| median_playtime | 12 | 26673 | 26685 | 5.16 | 68.92 | 0 | 0 | 0 | 0 | 3293 | ▇▁▁▁▁▁▁▁ |
| metascore | 23836 | 2849 | 26685 | 71.89 | 11.02 | 20 | 66 | 73 | 80 | 98 | ▁▁▁▂▅▇▆▁ |
| min_owners | 0 | 26685 | 26685 | 91277.87 | 950467 | 0 | 0 | 0 | 20000 | 1e+08 | ▇▁▁▁▁▁▁▁ |
| number | 0 | 26685 | 26685 | 2904.34 | 2377.1 | 1 | 821 | 2356 | 4524 | 8846 | ▇▅▃▃▂▂▁▁ |
| price | 3094 | 23591 | 26685 | 8.95 | 12.16 | 0.49 | 2.99 | 5.99 | 9.99 | 595.99 | ▇▁▁▁▁▁▁▁ |
Some game names are found multiple times:
video_games %>% group_by(game, developer, publisher) %>%
summarise(total = n()) %>% filter(total > 1) %>%
DT::datatable(caption = "Games listed multiple times")video_games %>% group_by(game) %>%
summarise(total = n(),
release_dates = paste(release_date, collapse = ", ")) %>%
filter(total > 1) %>%
DT::datatable(caption = "Duplicated games have more than 1 release date")Spot check for one game on Steam’s website did show different release dates. Here are the screenshots for “18 Wheels of Steel: American Long Haul” from the browsing page (release date is Jan 29, 2008)
browse
and from the game page (release date is Dec 3, 2007).
browse
video_games %>% arrange(release_date) %>%
plot_ly(x = ~release_date, y = ~price, type = "scatter", mode = "markers", text = ~game) %>%
layout(title = "Relationship between release date and the price",
xaxis = list("release date", zeroline = FALSE),
yaxis = list("game price, USD", zeroline = FALSE))## Warning: Ignoring 3126 observations
Seems like there is a trend in terms of the price but not huge. We do observe a few outliers.
Select only the developers that put out at least 15 games for a smaller plot.
video_games %>% filter(!is.na(developer)) %>%
group_by(developer) %>%
summarise(total_games = n_distinct(game)) %>%
filter(total_games > 15) %>%
arrange(total_games) %>%
plot_ly(x = ~developer, y = ~total_games, type = "bar") %>%
layout(title = "Most prolific developers",
xaxis = list(title = list(text = ""),
showspikes = TRUE,
spikedash = "solid",
spikecolor = "red",
showgrid = TRUE,
categoryorder = "array",
categoryarray = ~total_games,
tickangle = 45),
yaxis = list(title = list(text = "games"))) Similar approach to the publishers, but change the game minimum to 50.
video_games %>% filter(!is.na(publisher)) %>%
group_by(publisher) %>%
summarise(total_games = n_distinct(game)) %>%
filter(total_games > 50) %>%
arrange(total_games) %>%
plot_ly(x = ~publisher, y = ~total_games, type = "bar") %>%
layout(title = "Most prolific publishers",
xaxis = list(title = list(text = ""),
showspikes = TRUE,
spikedash = "solid",
spikecolor = "red",
showgrid = TRUE,
categoryorder = "array",
categoryarray = ~total_games,
tickangle = 45),
yaxis = list(title = list(text = "games"))) Hey! Big Fish Games!
How many developers are also publishers and how many games do they have?
video_games %>% filter(!is.na(developer) & !is.na(publisher)) %>%
filter(developer == publisher) %>%
group_by(publisher) %>%
summarise(total_games = n_distinct(game)) %>%
filter(total_games > 15) %>%
arrange(total_games) %>%
plot_ly(x = ~publisher, y = ~total_games, type = "bar") %>%
layout(title = "Most prolific devlopers who are also publishers",
xaxis = list(title = list(text = ""),
showspikes = TRUE,
spikedash = "solid",
spikecolor = "red",
showgrid = TRUE,
categoryorder = "array",
categoryarray = ~total_games,
tickangle = 45),
yaxis = list(title = list(text = "games")))Ownership will be considered as the max number of owners.
video_games %>% arrange(desc(max_owners)) %>% slice(1:100) %>%
plot_ly(x = ~game) %>%
add_trace(y = ~max_owners, name = "max owners", type = "scatter", mode = "markers") %>%
add_trace(y = ~min_owners, name = "min owners", type = "scatter", mode = "markers") %>%
layout(title = "Top 100 most owned games",
xaxis = list(title = list(text = ""),
tickangle = 45),
yaxis = list(side = "left", title = "owners"))video_games %>% arrange(desc(max_owners)) %>% slice(1:100) %>%
plot_ly(x = ~max_owners, y = ~metascore, text = ~game, type = "scatter", mode = "markers") %>%
layout(title = "Top 100 most owned games and their metascore",
xaxis = list(title = "maximum estimated number of owners"),
yaxis = list(title = "metascore"))## Warning: Ignoring 30 observations
Not necessarily! Again, tapping into my husband’s knowledge of video games, he explained that some of the games weren’t available through Steam right away.
Compare average and median time played for each game in the last two weeks (do remove the missing values and 0)
video_games %>% filter(!is.na(average_playtime) & !is.na(median_playtime) &
average_playtime != 0 & median_playtime != 0) %>%
plot_ly(x = ~average_playtime, y = ~median_playtime,
type = "scatter",
mode = "markers",
hoverinfo = "text",
text = ~paste0("average: ", average_playtime,
"\nmedian: ", median_playtime,
"\ngame: ", game)) %>%
layout(title = "Average vs median playtime",
xaxis = list(title = "average over two weeks, min"),
yaxis = list(title = "median over two weeks, min"),
shapes = list(type = "line", x0 = 0, y0 = 0, x1 = 1, y1 = 1,
yref = "paper", xref = "paper",
line = list(color = "red")))The points below the diagonal line have average larger than median which means that a lot more people don’t spend too much time playing the game; the points above the diagonal line have median larger than the average which means that those games have a few very hard core players (they spend a lot of time playing the game).
I have certainly not explored all information in the data. Of course, it would be worthwhile looking at the date of release and the number of minutes playes. Are older games still holding up?
From conversation with my husband it seems that it is valuable to have some domain knowledge about this area and therefore additional variables. For example, information about where the game was realeased first (console?), whether the game is free to play and how many players play a certain game.
For this first time I definitely spent less time had I used ggplot2 library instead of plotly, however, this wasn’t the goal.
I have observed a strange behavior with plotly when I was trying to plot the top 100 games owned by most people and order the x axis by the maximum number of owners:
video_games %>% arrange(desc(max_owners)) %>% slice(1:100) %>%
plot_ly(x = ~game) %>%
add_trace(y = ~max_owners,
name = "max owners",
type = "scatter",
mode = "markers") %>%
add_trace(y = ~min_owners,
name = "min owners",
type = "scatter",
mode = "markers") %>%
layout(title = "Top 100 most owned games",
xaxis = list(title = list(text = ""),
tickangle = 45,
categoryorder = "array",
categoryarray = ~ max_owners),
yaxis = list(side = "left", title = "owners"))Why in this case the x axis labels are cut off and the first 2 values in the x axis are 200,000,000? I also seem to replicate the bahavior when I use categoryarray = ~min_owners, but then I have 100,000,000. I think this is a reproducible example that can be posted as a question on stack overflow.